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We study the approximation of arbitrary distributions P on d-di- 
mensional space by distributions with log-concave density. Approx- 
imation means minimizing a Kullback-Leibler-type functional. We 
show that such an approximation exists if and only if P has finite 
first moments and is not supported by some hyperplane. Further- 
more we show that this approximation depends continuously on P 
with respect to Mallows distance Di(-, ■). This result implies consis- 
tency of the maximum likelihood estimator of a log-concave density 
under fairly general conditions. It also allows us to prove existence 
and consistency of estimators in regression models with a response 
Y = fJ,(X) + e, where X and e are independent, /x(-) belongs to a cer- 
tain class of regression functions while e is a random error with log- 
concave density and mean zero. 



1. Introduction. Log-concave distributions, that is, distributions with 
a Lebesgue density the logarithm of which is concave, are an interesting 
nonparametric model comprising many parametric families of distributions. 
Bagnoli and Bergstrom (2005) give an overview of many interesting prop- 
erties and applications in econometrics. Indeed, these distributions have re- 
ceived a lot of attention among statisticians recently as described in the 
review by Walther (2009). The nonparametric maximum likelihood esti- 
mator was studied in the univariate setting by Pal, Woodroofe and Meyer 
(2007), Rufibach (2006), Dumbgen, Hiisler and Rufibach (2007), Balabdaoui, 
Rufibach and Wellner (2009) and Dumbgen and Rufibach (2009). These ref- 
erences contain characterizations of the estimators, consistency results and 
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explicit algorithms. Extensions of one or more of these aspects to the mul- 
tivariate setting are presented by Cule, Samworth and Stewart (2010), Cule 
and Samworth (2010), Koenker and Mizera (2010), Seregin and Wellner 
(2010) and Schuhmacher and Diimbgen (2010). Both Cule and Samworth 
(2010) and Schuhmacher, Hiisler and Diimbgen (2009) show that multivari- 
ate log-concave distributions are a very well-behaved nonparametric class. 
For instance, moments of arbitrary order are continuous statistical function- 
als with respect to weak convergence. 

The first aim of the present paper is a deeper understanding of the ap- 
proximation scheme underlying the maximum likelihood estimator of a log- 
concave density. Let us put this into a somewhat broader context: let Q n 
be the empirical distribution of independent random vectors X±, X2, ■ ■ ■ , X n 
with distribution Q on a given open set X <ZM. d . Suppose that Q has a den- 
sity / belonging to a given class J- of probability densities on X . The max- 
imum likelihood estimator of / may be written as 



(provided this exists and is unique). Even if Q fails to have a density within 
one may view f n as an estimator of the approximating density 



In fact, if Q has a density g £ J- on X such that the integral J g(x) logg(x) dx 
exists in M, one may rewrite f(-\Q) as the minimizer of the Kullback-Leibler 
divergence, 



over all / 6 T . Note the well-known fact that Dkl (/><?) > unless / = g al- 
most everywhere. Viewing a maximum likelihood estimator as an estimator 
of an approximation within a given model is common in statistics [see, e.g., 
Pfanzagl (1990), Patilea (2001), Doksum et al. (2007) and Cule and Sam- 
worth (2010)]. Pfanzagl (1990) and Patilea (2001) show that under suitable 
regularity conditions on Q and J 7 , the estimator f n is consistent with certain 
large deviation bounds or rates of convergence, even in the case of misspec- 
ified models. To the best of our knowledge, their results are not directly 
applicable in the setting of log-concave densities, which is treated by Cule 
and Samworth (2010). Our ambition is to identify the largest possible class 
of distributions Q such that f(-\Q) is well defined and unique. Moreover, we 
want to show that the mapping Q 1— > f(-\Q) is continuous on that class with 
respect to a coarse topology, ideally the topology of weak convergence. 




f(-\Q) :=argmax / log(/)dQ. 
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With these goals in mind, let us tell a short success story about Grenan- 
der's estimator [Grenander (1956)], also a key example of Patilea (2001): let 
•^mon be the class of all nonincreasing and left-continuous probability den- 
sities on X = (0, oo). Then for any distribution Q on (0, oo), the maximizer 

/monHQ) := argmax / log f(x)Q(dx) 

/eJ-Vnon J (0,00) 

is well defined and unique. Namely, if G denotes the distribution function 
of Q, then /monHQ) is the left-sided derivative of the smallest concave ma- 
jorant of G on (0, oo) [see Barlow et al. (1972)]. With this characterization 
one can show that for any sequence of distributions Q n on (0, oo) converging 
weakly to Q, 

/ \fmon(x\Q n ) ~ fmon(x\Q)\dx ^0 (n-tOO). 
.7(0,oo) 

Since the sequence of empirical measures Q n converges weakly to Q almost 
surely, this entails strong consistency of the Grenander estimator f(-\Q n ) in 
total variation distance. 

In the remainder of the present paper we consider the class J- of log-con- 
cave probability densities on X = K d . We will show in Section 2 that f(-\Q) 
exists and is unique in L 1 (M c( ) if and only if 

|x||Q(dx) < oo 

and 

Q(H) < 1 for any hyperplane H C R d . 

Some additional properties of f(-\Q) will be established as well. We show 
that the mapping Q i— > f(-\Q) is continuous with respect to Mallows distance 
[Mallows (1972)] Z?i(-, •), also known as a Wasserstein, Monge-Kantorovich 
or Earth Mover's distance. Precisely, let Q satisfy the properties just men- 
tioned, and let (Q n ) n be a sequence of probability distributions converging 
to Q in D\ \ in other words, 

(1) Qn^wQ and J \\x\\Q n (dx) -> j \\x\\Q{dx) 

as n — > oo. Then f(-\Q n ) is well defined for sufficiently large n and 



lim I \f(x\Q n )-f(x\Q)\dx = 0. 

n— >oc J 



This entails strong consistency of the maximum likelihood estimator f n , 
because (Q n )n converges almost surely to Q with respect to Mallows distance 
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Di(-,-). In addition we show that Q \— > maxj g jr Jlog(f)dQ is convex and 
upper semicontinuous with respect to weak convergence. 

In Section 3 we apply these results to the following type of regression 
problem: suppose that we observe independent real random variables Y\, 
Y2, . . . , Y n such that 

Yi = n(xi) + £i 

for given fixed design points Xi,X2, ■ ■ ■ ,x n in some set X, some unknown 
regression function fi : X — > R and independent, identically distributed ran- 
dom errors £j with unknown log-concave density / and mean zero. We will 
show that a maximum likelihood estimator of (//, /) exists and is consistent 
under certain regularity conditions in the following two cases: (i) X = R q 
and \x is affine (i.e., affine linear); (ii) X = M. and /U is nondecreasing. These 
methods are illustrated with a real data set. 

Many proofs and technical arguments are deferred to Section 4. A longer 
and more detailed version of this paper is the technical report by Diimbgen, 
Samworth and Schuhmacher (2010), referred to as [DSS 2010] hereafter. It 
contains all proofs, additional examples and plots, a detailed description 
of our algorithms and extensive simulation studies. There we also indicate 
potential applications to change-point analyses. 



2. Log-concave approximations. For a fixed dimension d € N, let $ = 
&(d) be the family of concave functions (f>:M. d — > [—00,00) which are upper 
semicontinuous and coercive in the sense that 

(f){x) -4- —00 as — > 00. 

In particular, for any <fi 6 $ there exist constants a and b > such that 
<j)(x) < a — b\\x\\, so J e^ x ) dx is finite. Further let Q = Q(d) be the family of 
all probability distributions Q on M. d . Then we define a log- likelihood-type 
functional 

L((j>, Q) := J (J)dQ- J e m dx + 1 
and a profile log-likelihood 

L(Q) := sup £(</>, Q). 

If, for fixed Q, there exists a function ip G $ such that L(ip, Q) = L(Q) € M, 
then it will automatically satisfy 



e^ x) dx = l. 
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To verify this, note that <j> + c G <3? for any fixed function eft G 3> and arbitrary 
c£K, and 

dc 



if L(</>, Q) G K. Thus L(0 + c, Q) is minimal for c = — log j e^ x ^ dx. 



2.1. Existence, uniqueness and basic properties. The next theorem pro- 
vides a complete characterization of all distributions Q G Q with real profile 
log-likelihood L(Q). To state the result we first define the convex support 
of a distribution Q G Q and collect some of its properties. 

Lemma 2.1 (DSS 2010). For any Q £ Q, the set 

csupp(Q) := 0{C : C C M d closed and convex, Q{C) = 1} 

is iiseZ/ closed and convex with Q(csupp(Q)) = 1. XTie following three proper- 
ties of Q are equivalent: 

(a) csupp(<5) has nonempty interior; 

(b) Q{H) < 1 for any hyperplane H C M d ; 

(c) with Leb denoting Lebesgue measure on M. d , 

limsup{Q(C) :CcM d closed and convex, Leb(C) <5}<1. 

<5J_0 

Theorem 2.2. For any Q G Q, £/je i>a/ne of L(Q) is real if and only if 

J \\x\\Q(dx) < oo and interior(csupp(Q)) 7^ 0. 

In i/iai case, there exists a unique function 

ip = i/){-\Q) G argmaxL(0,Q). 

TTiis function tp satisfies J e^^ dx = 1 and 

interior(csupp(<5)) C dom(^) := {x G R rf :?/;(x) > —00} C csupp(Q). 

Remark 2.3 [Moment (in)equalities]. Let Q G Q satisfy the properties 
stated in Theorem 2.2. Then the log-density ip = tp(-\Q) satisfies the follow- 
ing requirements: L(tp, Q) = f ipdQ G R, and for any function A : W 1 — > R, 

(2) J AdQ< J A(x)e^ x) dx if V + t A G $ for some t > 0. 

This follows from 

limt- 1 ( J L(^ + tA,Q)-L(^,Q))= / AdQ- ( A(x)e^ x Ux. 
^4-0 J J 
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Let P be the approximating probability measure with P(dx) = e^ x Ux. 
It satisfies the following (in) equalities: 

(3) J hdP < J hdQ for any convex h : M d — >• (—00, 00], 

(4) J -fw). 

To verify (3), let v G M. d be a subgradient of /i at 0, that is, h{x) > h(0) + 
-u T x for all x G R rf . Since fp(x) < a — b\\x\\ for arbitrary x G R d and suitable 
constants a and b > 0, the function A:= —h satisfies the requirement that 
ip + iA G $ whenever < i < &/IMI- H ence the asserted inequality follows 
from (2). The equality for the first moments follows by setting h(x) := ±v T x 
for arbitrary v G M. d . 

In what follows let 

Q 1 = Q 1 (d) := i.Q G Q: J \\x\\Q(dx) < ooj, 

Qo = Qo{d) :={QeQ: interior(csupp(Q)) + 0}. 

Thus L(Q) G R if and only if Q G Q Q n Q . Moreover, the proof of Theo- 
rem 2.2 shows that 



-00, for Q G Q \ Q 1 , 
+00, for Q G Q 1 \ Q . 



Remark 2.4 (Afflne equivariance) . Suppose that Qs Q D n Q 1 . For arbi- 
trary vectors a G M. d and nonsingular, real dxd matrices B define Q a ,B to be 
the distribution of a + BX when X has distribution Q. Then Q a ,B £ QoH Q , 
too, and elementary considerations reveal that 

I(Qo,fl) = ^(0)-log|detS| 

and 

ip(x\Q a>B ) = ip(B~ 1 (x - a)\Q) - log|det B\ for x G R d . 

Remark 2.5 (Convexity, DSS 2010). The profile log-likelihood L is con- 
vex on Q . Precisely, for arbitrary Qo,Qi G Q} and < t < 1, 

L((l - t)Q + *Qi) < (1 - t)£(Qo) + tL(Qi)- 
The two sides are equal and real if and only if Qo,Qi G Q n Q 1 with 

^(•|Qo) = ^(-|Qi)- 
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Remark 2.6 (Concave majorants, DSS 2010). Let tp = tp(-\Q) for a dis- 
tribution QgQoHQ 1 . For any open set U C R d there exists a (pointwise) 
minimal function tpjj £ $ such that ipu > ip on R rf \ U. In particular, ipu < tp 
with equality on R rf \ U . One can also show that ipu is the pointwise infimum 
of all affine functions <p such that cp > ip on M. d \ U. If supp(Q) denotes the 
smallest closed set iCR d with Q(A) = 1, then 

Ip = 1pRd\ supp (Qy 

Furthermore, suppose that Q has a density g on an open set U such that 
•0 > logg on this set. Then 

ip = ipu- 

2.2. The one- dimensional case. For the case of d = 1 one can generalize 
Theorem 2.4 of Diimbgen and Rufibach (2009) as follows: for a function 
let 

5(0) :={i£ dom(0) : <p{x) > 2" 1 (0(x - 5) + 0(x + <5)) for all 5 > 0}. 

The log-concave approximation of a distribution on R can be characterized 
in terms of distribution functions only: 

Theorem 2.7. Lei Q be a nondegenerate distribution on R u>i£/i finite 
first moment and distribution function G. Let F be a distribution function 
with log-density (p £ Then <p = ip(-\Q) if and only if 

(F(t)-G(t))dt = 

J — oo 

and 

' < 0, for all x € R, 



{F{t)-G{t))dt 



0, /or a// x 6 5(0). 



Remark 2.8 (DSS 2010). One consequence of this theorem is that the 
c.d.f. F of ip(-\Q) follows the c.d.f. G of Q quite closely in that 

G(x-) < F(x) < G{x) for arbitrary x 6 <S(0(-|Q)). 

Example 2.9. Let Q be a rescaled version of Student's distribution t2 
with density and distribution function 

g(x) = 2~ 1 (l + x 2 )-V 2 and G(x) = 2~ l (l + (1 + x 2 )^ 2 x), 

respectively. The best approximating log-concave distribution is the Laplace 
distribution with density and distribution function 

/(.J-rt-M and F(x)-{«f). w> £ * | $ 
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respectively. To verify this claim, note that by symmetry it suffices to show 
that 



for all x < 0. Clearly this expression is zero for x = 0, and elementary con- 
siderations show that it is nonpositive for all x < 0. Numerical calculations 
reveal that \F — G\ is smaller than 0.04 everywhere. 

Remark 2.10. Let Q <G Q r\Q l such that Q(a,b) = for some bounded 
interval (a, b) C csupp(Q). Then ip = ip('\Q) is linear on [a,b]. This follows 
from Remark 2.6, applied to U = (a,b). Note that tp(a) > — co and ip(b) > 
— co, because otherwise tp = -co on (-co, a] or on [b, oo). But this would be 
incompatible with f ipdQ £l, because both Q((— co,a]) and Q([b, co)) are 
positive. 

Remark 2.11 (DSS 2010). Suppose that Q has a continuous but not 
log-concave density g. Nevertheless one can say the following about the 
approximating log-density ip = ip(-\Q): 

(i) Suppose that log g is concave on an interval (-co, a] with g{a) > 
and tp(a) < log g(a). Then there exists a point a' G [-co, a] such that ip is 
linear on (a', a] and ip = logg on (-co, a']. 

(ii) Suppose that log g is differentiable everywhere, convex on a bounded 
interval [a, b] and concave on both (-co, a] and [b, co). Then there exist 
points a' € (-co, a] and b' € [b, oo) such that tp is linear on [a', b'] while 
tp = logg on (-co, a'] U [b', co). 

(iii) Suppose that logg is convex on an interval (-co, a] such that — co < 
log g (a) < ip(a). Then ip is linear on (-co, a]. 

Example 2.12. Let us illustrate part (ii) of Remark 2.11 with a numeri- 
cal example. Figure 1 shows the bimodal density g (green/dotted line) of 
the Gaussian mixture Q = 0.7 • J\f{— 1.5, 1) + 0.3 • jV(1.5, 1) together with its 
log-concave approximation f = f(-\Q) (blue line). As predicted, there exists 
an interval [a', b'] such that f = g on M \ (a', b') and log / is linear on [a' , b'\ . 

2.3. Continuity in Q. For the applications to regression problems to fol- 
low we need to understand the properties of both Q L(Q) and Q t->- V>(-|Q) 
on Q 1 D Q . Our first hope was that both mappings would be continuous with 
respect to the weak topology. It turned out, however, that we need a some- 
what stronger notion of convergence, namely, convergence with respect to 




for x < 0, 
for x = 0. 



Indeed the integral on the left-hand side equals 



2^ 1 (exp(x)-x-(l + x 2 ) 1 / 2 ) 
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Fig. 1. Density of a Gaussian mixture and its log-concave approximation. 



Mallows distance D\ which is defined as follows: for two probability distri- 
butions Q,Q' e Q 1 , 

D\(Q.Q') : inf E\\X - X'\\. 
(X,X') 

where the infimum is taken over all pairs (X, X') of random vectors X ~ Q 
and X' ~ Q' on a common probability space. It is well known that the 
infimum in Di(Q,Q') is a minimum. The distance D\ is also known as 
Wasserstein, Monge-Kantorovich or Earth Mover's distance. An alternative 
representation due to Kantorovic and Rubinstein (1958) is 

D l {Q,Q')= sup / hdQ- [ hdQ' 

where T~Ll consists of all h:M. d — > M. such that \h(x) — h(y)\ < \\x — y\\ for 
all x,y € M. d . Moreover, for a sequence (Q n ) n in Q 1 , it is known that (1) is 
equivalent to Di(Q n ,Q) — > as n — > oo [Mallows (1972), Bickel and Freed- 
man (1981)]. In case of d = 1, the optimal coupling of Q and Q' is given by 
the quantile transformation: if G and G' denote the respective distribution 
functions, then 



Dx(Q,Q' 



\G~ 



G'~ l {u)\du 



\G(x)-G'(x)\dx. 



A good starting point for more detailed information on Mallows distance is 
Chapter 7 of Villani (2003). 

Before presenting the main results of this section we mention two useful 
facts about the convex support of distributions. 
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Lemma 2.13. Given a distribution Q € Q, a point x S M rf is an interior 
point of csupp(Q) if and only if 

h(Q, x) := sup{Q(C) : C C R d closed and convex, x ^ interior (C)} < 1. 

Moreover, if (Q n ) n is a sequence in Q converging weakly to Q, then 

lim sup h(Q n , x) < h(Q,x) foranyx£~R. d . 

This lemma implies that the set Q a is an open subset of Q with respect 
to the topology of weak convergence. The supremum h(Q,x) is a maximum 
over closed halfspaces and is related to Tukey's halfspace depth [Donoho 
and Gasko (1992), Section 6]. For a proof of Lemma 2.13 we refer to [DSS 
2010]. Now we are ready to state the main results of this section. 

Theorem 2.14 (Weak upper semicontinuity) . Let (Q n ) n be a sequence 
of distributions in Q a converging weakly to some Q G Q Q . Then 

lim sup L(Q n ) < L(Q). 
Moreover, lim inf n _j. oo L(Q n ) < L(Q) if and only if 

limsup / ||x||Q n ((ix) > / ||x||(5((ix). 



This result already entails continuity of L(-) on Q Q n Q 1 with respect 
to Mallows distance D\. The next theorem extends this result to L : Q} — >• 
(— oo, oo]: 

Theorem 2.15 (Continuity with respect to Mallows distance D\). Let 
(Qn)n be a sequence of distributions in Q} such that lim.n_i.oo D 1 (Q n ,Q) = 
for some Q £ Q} . Then 

lim L(Q n )=L(Q). 

n— >oo 

In case of Q G QoHQ 1 , the probability densities f := exp oip(-\Q) and f n := 
exp oifj(-\Q n ) are well defined for sufficiently large n and satisfy 

lim f n (x) = f(y) for ally eR d \8{f>0}, 

n— ¥oo,x— yy 

limsup f n (x) < f(y) for all y £ d{f > 0}, 

n— >oo,x— yy 



lim f \f n (x)-f(x)\dx = 0. 
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Remark 2.16 (Stronger modes of convergence). The convergence of 
(fn)n to / in total variation distance may be strengthened considerably. 
It follows from recent results of Cule and Samworth (2010) or Schuhmacher, 
Hiisler and Diimbgen (2009) that (/„)„, — > f uniformly on arbitrary closed 
subsets of R d \ disc(/), where disc(/) is the set of discontinuity points of 
/. The latter set is contained in the boundary of the convex set {/ > 0}, 
hence a null set with respect to Lebesgue measure. Moreover, there exists 
a number e(/) > such that 

lim f e £ ^^\f n (x)-f(x)\dx = 0. 

n— >oo J 

More generally, 

lim / e A ^\f n (x)-f(x)\dx = 

n— >oo J 

for any sublinear function AtR^ — >■ R such that limya-n^oo e A ^ f(x) = 0. 

3. Applications to regression problems. Now we consider the regres- 
sion setting described in the Introduction with observations Y{ = fi(xi) + 
1 < i <n, where the Xi G X are given fixed design points, /i : X — > R is an 
unknown regression function, and the E{ are independent random errors with 
mean zero and unknown distribution Q on R such that ip = ip(-\Q) is well 
defined. The regression function is assumed to belong to a given family 
Ai with the property that 

m + c G A4 for arbitrary m€M,cGl. 

3.1. Maximum likelihood estimation. We propose to estimate (^/i) by 
a maximizer of 

n 

AU m) ■= - M - m(xi)) - / e^ x) dx + 1 
n U J 

over all (<j),m) G $ x A4. Note that A(^,m) remains unchanged if we replace 
(4>,m) with ((/;(• + c),m + c) for an arbitrary c G R. For fixed m, the maxi- 
mizer <j) = (p m of A(-,m) over <J> will automatically satisfy J exp(<p(x)) dx = 
1 and J xex.p((p(x))dx = n~ 1 ^21 =1 (Yi — m(xi)). Thus if (4>,m) maximizes 
A(-, •) over <I> x M, then 

1 n 

:= (</>(• + c),m + c) with c := - ^(Yi - rh(xi)) 

n z — ' 
i=l 

maximizes A(</>, m) over all (<f>,m) G 3? x A4 satisfying the additional con- 
straint that exp o (p defines a probability density with mean zero. 
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Define x := (xi)f =1 and m(x) := (m(xi))™ =1 . Then we may write 

A(0,m) Q m(x )) 
with the empirical distributions 



1 n 

n ^— » 



n 

i=l 



for v = (fi)™ =1 € H. n . Thus our procedure aims to find 
(4>,m)e argmax L(0,Q m(x) ), 

and this representation is our key to proving the existence of Before 
doing so we state a simple inequality of independent interest, which follows 
from Jensen's inequality and elementary considerations: 

Lemma 3.1 (DSS 2010). For any distribution Q G Q l {l), 



L(Q) <-log(^2 J \x-Med(Q)\Q(dx) \ <~ lo s(^J 



x - n{Q)\Q{dx] 



where Med(Q) is a median of Q while n(Q) denotes its mean J xQ{dx). 

Theorem 3.2 (Existence in regression). Suppose that the set .M(x) := 
{m(x) :m G A4} C W 1 is closed and does not contain Y := (5^)™ =1 . Then 
there exists a maximizer (4>,rh) o/A(0, m) over all (4>,m) G $ x M.. 

The constraint Y ^ A4(x) excludes situations with perfect fit. In that 
case, the Dirac measure 6q would be the most plausible error distribution. 

Example 3.3 (Linear regression). Let X = M 9 , and let M consist of all 
affine functions on M 9 . Then .M(x) is the column space of the design matrix 

n T 



X 



1 1 ••• 1 

xi x 2 ■■■ x r 



onx(g+l) 



hence a linear subspace of M n . Consequently there exists a maximizer (<fi, rh) 
of A over <I> x M , unless YeM(x). 

Example 3.4 (Isotonic regression). Let X be some interval on the real 
line, and let Ad consist of all isotonic functions m : X — > M. Then the set 
A4 (x) is a closed convex cone in lR n . Here the condition that Y ^ A4 (x) is 
equivalent to the existence of two indices i, j G {1, 2, . . . , n} such that xi < Xj 
but Yi > Y. 
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Fisher consistency. The maximum likelihood estimator (ip,fi) need not 
be unique in general. Nevertheless we will prove it to be consistent under 
certain regularity conditions. A key point here is Fisher consistency in the 
following sense: note that the expectation measure of the empirical distri- 
bution Q m ( x ) equals 

1 n 

E Qm(x) = - ^2 Q * < Wi)-m(as < ) = Q * R(n-m)(x) 

i=i 

with 

1 n 

Rv '■= — / 8vi- 
n f-f 

i=i 

But 

L(Q * R(lj,-m)(x)) < L(Q) 

with equality if and only if fi — m is constant on {xi,X2, ■ ■ ■ , x n }. This follows 
from a more general inequality which is somewhat reminiscent of Anderson's 
lemma [Anderson (1955)]: 

Theorem 3.5. Let Q e Q Q (d) n Q 1 (d) and R e Q 1 (d). Then Q*Re 
QoDQ 1 and 

L(Q*R)<L(Q). 
Equality holds if and only if R = S a for some a € M. d . 

3.2. Consistency. In this subsection we consider a triangular scheme of 
independent observations (x n i,Y n i), 1 < i < n, with fixed design points x n i € 
X n and 

where fi n is an unknown regression function in AA n and £ni> &n2i • • • i^nn 
are unobserved independent random errors with mean zero and unknown 
distribution Q n <E Q G (1) n Q 1 (l). Two basic assumptions are: 

(A.l) M n {* n ) is a closed subset of M. n for every n £ N; 

(A.2) Dx(Q n ,Q) -> for some distribution Q € Q (l) n Q^ 1 )- 

We write (^ n ,/i n ) for a maximizer of L((j),Qn,m) over all pairs (4>,m) € 
$ x A4 n such that J" da; = 1 and J xe^ x > dx = 0, where Q n>m stands for 
the empirical distribution of the residuals Y n { — m(x n i), 1 < i < n. We also 
need to consider its expectation measure 

Qn,m := IEQn,m = Qn * R(u n — m)(x n ) • 
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Furthermore we write 



1 n 

for v=(^)? =1 GM n . 



n 
i=i 



It is also convenient to metrize weak convergence. In Theorem 3.6 below we 
utilize the bounded Lipschitz distance: for probability distributions Q, Q' on 
the real line let 



Dbl(Q,Q'):= sup 

h£H B L 



hd{Q-Q') 



where Hbl is the family of all functions h:M.— > [—1,1] such that \h(x) — 
h{y)\ < \x — y\ for all x, y £ M. 

Theorem 3.6 (Consistency in regression). Let assumptions (A.l) and 
(A. 2) be satisfied. Suppose further that: 

(A. 3) for arbitrary fixed c> 0, 

SUp D B L{Qn,m, Qn,m) 0- 

m£Mn ■ \\(m — tln)(Xn)\\n<C 

Then, with f n := exp o ip{-\Q n ) and f n := exp o ip n , the maximum likelihood 
estimator (/ n ,/t n ) o/(/ n ,/i n ) exists with asymptotic probability one and sat- 
isfies 



I 



\fn(x) ~ f n {x)\dx -> p 0, \\(fl n - ^„)(x n )|| n -> p 0. 



We know already that assumption (A.l) is satisfied for multiple linear 
regression and isotonic regression. Assumption (A. 2) is a generalization of 
assuming a fixed error distribution for all sample sizes. The crucial point, of 
course, is assumption (A. 3). In our two examples it is satisfied under mild 
conditions: 

Theorem 3.7 (Linear regression). Let M n be the family of all affine 
functions on X n :=R ? W. If assumption (A. 2) is satisfied, then (A. 3) follows 
from 

lim q(n) In = 0. 

Theorem 3.8 (Isotonic regression). Let M. n be the set of all nonde- 
creasing functions on an interval X n C R. If assumption (A. 2) holds true, 
then (A. 3) follows from 

||/Wn(x n )||n = 0(1). 
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The proof of Theorem 3.7 is given in Section 4. For the proof of Theo- 
rem 3.8, which uses similar ideas and an additional approximation argument, 
we refer to [DSS 2010]. 

3.3. Algorithms and numerical results. Computing the maximum likeli- 
hood estimator (x/j, fi) from Section 3.1 turns out to be a rather difficult task, 
because the function A can have multiple local maxima. In [DSS 2010] we 
discuss strengths and weaknesses of three different algorithms, including an 
alternating and a stochastic search algorithm. The third procedure, which 
is highly successful in the case of linear regression, is global maximization of 
the profile log-likelihood A(9) := max^ g $ A(4>, mg), where me{x) = 9 T x for 
every by means of differential evolution [Price, Storn and Lampinen 
(2005)]. 

Extensive simulation studies in [DSS 2010] suggest that (i^,ft) provides 
rather accurate estimates even if n is only moderately large. For various 
skewed error distributions, fi may be considerably better than the corre- 
sponding least squares estimator. As an example consider the simple linear 
regression model with observations 

Yi = c + 9Xi + ei, l<i<n:=100, 

where X±, . . . ,X n are independent design points from the Unif[0,3] distri- 
bution and £i, . . . ,e n are independent errors from a centered gamma distri- 
bution with shape parameter r and variance 1. Note that the distribution 
of (tjj, 9 — 9) does not depend on c or 9. Monte Carlo estimation of the root 
mean squared error based on 1000 simulations of this model gives 0.023 for 
the estimator 9 versus 0.118 for the least squares estimator of 9 if r = 1, and 
0.095 versus 0.113 for the same comparison if r = 3. 

3.4. A data example. A familiar task in econometrics is to model expen- 
diture (Y) of households as a function of their income (X). Not only the 
mean curve (Engel curve) but also quantile curves play an important role. 
A related application are growth charts in which, for instance, X is the age 
of a newborn or infant and Y is its height or weight. 

We applied our methods to a survey of n = 7125 households in the United 
Kingdom in 1973 (data courtesy of W. Hiirdle, HU Berlin). The two variables 
we considered were annual income (X raw ) and annual expenditure for food 
(^raw)- Figure 2 shows scatter plots of the raw and log-transformed data. To 
enhance visibility we only show a random subsample of size n' = 1000. In 
addition, isotonic quantile curves x t- > qp{x) are added for /3 = 0.1, 0.25, 0.5, 
0.75, 0.9 (based on all observations). These pictures show clearly that the raw 
data are heteroscedastic, whereas for the log-transformed data, (Xi,Yi) = 
(log 10 X raW) j, log 10 y raWj j), an additive model seems appropriate. 
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Fig. 2. UK household data, raw (left) and log-transformed (right), with isotonic quantile 
curves. 

Interestingly, neither linear nor quadratic nor cubic regression yield con- 
vincing fits to these data. Polynomial regression of degree four or cubic 
splines with knot points at, say, 4.1, 4.3, 4.5, 4.7, 4.9 seem to fit the data 
quite well. Moreover, exact Monte Carlo goodness-of-fits test, assuming the 
regression function to be a cubic spline and based on a Kolmogorov-Smirnov 
statistic applied to studentized residuals, revealed the regression errors E{ to 
be definitely non-Gaussian. 

Figure 3 shows the data and estimated /3-quantile curves for ^ = 0.1, 
0.25, 0.5, 0.75, 0.9, based on our additive regression model. Note that the 
estimated /3-quantile curve is simply the estimated mean curve plus the 
/3-quantile of the estimated error distribution. On the left-hand side, we 



4.6 r 4.6 




3.8 4 4.2 4.4 4.6 4.8 5 5.2 3.8 4 4.2 4.4 4.6 4.8 5 5.2 



Fig. 3. Log-transformed UK household data with isotonic fits (left) and spline fits (right) 
from our additive model. 



LOG-CONCAVE APPROXIMATIONS 



17 



only assumed \i to be nondecreasing, on the right-hand side we fitted the 
aforementioned spline model. In both cases the fitted quantile curves are 
similar to the quantile curves in Figure 2 but with fewer irregularities such 
as big jumps which may be artifacts due to sampling error. 

4. Proofs. For the proof of Theorem 2.2 we need an elementary bound 
for the Lebesgue measure of level sets of log-concave distributions: 

Lemma 4.1 (DSS 2010). Let(fie<S> be such that f e^ x) dx = l. For real t 
define the level set Dt := {x G M. d : <fi{x) > t}. Then for r < M < max^ggd (fi(x), 

. pM—r 

Leb(D r ) < (M - r) d e~ M / J t d e~ l dt. 

Another key ingredient for the proofs of Theorems 2.2 and 2.15 is a lemma 
on pointwise limits of sequences in <£: 

Lemma 4.2 (DSS 2010). Let (fi and (fii,4>2, <fe, • • • be functions in $ such 
that 4> n < (fi for all n G N. Further suppose that the set 

C := |xGlR d :liminf^ n (x) > -oo) 

is nonempty. Then there exist a subsequence ((fi n (k))k °f {<fin)n and a function 
(fi G $ such that C C dom(c/>) = {(fi > — oo} and 

lim n (fc)( x ) = <fi(,y) f or a tt U £ interior(dom(</>)), 

k—*oo,x—^y 

lim sup (fi n (k){x) < <fi(y) < 4>{y) for all y G R d . 

fc— >oo,ie— »y 

Proof of Theorem 2.2. Suppose first that J\\x\\Q(dx) = oo. Since 
any cfi G <J? is majorized by x i-> a — b\\x\\ for suitable constants a and b > 0, 
this entails that L(Q) = — oo. 

Second, suppose that f \\x\\Q(dx) < oo but interior (csupp(Q)) = 0. Ac- 
cording to Lemma 2.1, the latter fact is equivalent to Q(H) = 1 for some hy- 
perplane H C M. d . For c£R define a function (fi c G $ via (fi c (x) := c— ||x|| for 
x G H and (j> c ( x ) '■= ~°° f° r x £H. Then L(cfi c , Q) = c — J \\x\\Q(dx) + 1 — > oo 
as c — > oo. 

For the remainder of this proof suppose that f \\x\\Q(dx) < oo and that 
csupp(Q) has nonempty interior. Since the concave function h(x) = — \\x\\ 
satisfies J hdQ > — oo, we have L{Q) > — oo. When maximizing L(cfi,Q) over 
all (fi G $ we may and do restrict our attention to functions (fi G such that 
J e^ x > dx = 1 (see end of Section 1) and dom((/>) = {(/>> — oo} C csupp(Q). 
For if dom(</>) G^ csupp(Q), replacing 0(x) with — oo for all x £ csupp(Q) 
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would also increase L((p,Q) strictly. Let <&(Q) be the family of all (ft G 
with these properties. 

Now we show that L(Q) < oo. Suppose that (J) G &(Q) is such that M := 
max^gjjd (j>(x) > 0. With D t := {4> > t} and for c> we get the bound 

L(<t>,Q) = j <pdQ< -cMQ(R d \ D_ cM ) + MQ(D_ cM ) 



= -(c+l)Ml — -Q(£>_ cM ) ]• 
According to Lemma 4.1, 

/>(l+c)M 

Leb(L>_ cA/ ) < (l + c) d M d e- A y y t d e- l dt 

= (1 + c) d M d e- M /(d! + o(l)) -> 

as M — )• oo for any fixed c > 0. But Lemma 2.1 entails that for sufficiently 
large c and sufficiently small 5 > 0, 

sup{Q(C) : C C K d closed and convex, Leb(C) < 5} < — — , 

c + 1 

whence 

Q) — > — oo as max0(x) — > oo. 

Note also that L((p,Q) < max a;gR d0(x) for any eft G $(<5). These considera- 
tions show that L(Q) is finite and, for suitable constants M Q < M*, equals 
the supremum of L((p, Q) over all <fi G & (Q) such that M Q < max x 4>(x) < M*. 

Next we show the existence of a maximizer ^ G $((5) of L(-,Q). Let (4> n )n 
be a sequence of functions in <&(Q) such that — oo < L((p n ,Q) ^ L(Q) as 
ra — > oo, where M n := max xeR d 4> n (x) G [M , M*] for all n > 1. Now we show 
that 

(5) inf 4> n (x ) > — oo for any x Q G interior(csupp(Q)). 

n>l 

If 4> n (%o) < Mn, then x Q is not an interior point of the closed, convex set 
{<Pn > 4>n(x )}- Hence 



(j) n dQ < <j) n {x ) + {M n - <pn{x ))Q{<Pn > <Pn{x )} 

< 4>n(x ) + (M n - (p n (x ))h(Q,x ) 

< 0„(x o )(l - h(Q,x )) + max(M„,0) 

with h(Q,x Q ) < 1 defined in Lemma 2.13. In the case of 4> n {x ) = M n these 
inequalities are true as well. Thus 

, . > max(M n ,0)-L((f) n ,Q) > max(M,, 0) - LQfr, g) 

l-/i(Q,x ) l-/i(Q,z ) 



LOG-CONCAVE APPROXIMATIONS 



19 



which establishes (5). Combining (5) with <p n < M*, we may deduce from 
Lemma 3.3 of Schuhmacher, Hiisler and Diimbgen (2009) that there exist 
constants a and b > such that 

(6) 4> n (x) < a - b\\x\\ for all n G N, x G R d . 

The inequalities (5) and (6) and Lemma 4.2 with C D interior (csupp(Q)) 
and 4>(x) := a — b\\x\\ imply existence of a function tp G $ and a subsequence 
{^nik))k of {<t>n)n such that ^ = — oo on R d \ csupp(Q) and 

limsup(/> n (fc)(3;) < ip(x) < a — b\\x\\ for all x G M d , 

lim 4> n (k)(x) = ip(x) > — oo for all x G interior(csupp(C})). 

fc— >oo 

Since the boundary of csupp(Q) has Lebesgue measure zero, it follows from 
dominated convergence that Je^dx = 1. Mor eover, applying Fatou's lem- 
ma to the nonnegative functions x\- > a — b\\x\\ — 4>n(k)( x ) yields 

limsup / 4> n [k)dQ< j ipdQ. 

k— >oo J J 

Hence 

L(Q) > L(if>,Q) > limsu P L(0 n(fe) ,Q) = L(Q) 

and thus L(ip,Q) = L(Q). 

Uniqueness of the maximizer ip follows essentially from strict convexity 
of the exponential function: if ip G 3>(<3) with L{ip,Q) > — oo, then L{{\ — 
t)ip + tip, Q) is strictly concave in t G [0, 1], unless Leb{ijj ^ ip} = 0. But for 
tp, ip G &(Q), the latter requirement is equivalent to ip = ip everywhere. □ 

In our proofs of Theorems 2.7 and 2.15 we utilize a special approximation 
scheme for functions in <£: 

Lemma 4.3 (DSS 2010). For any function <j) G $ with nonempty domain 
and any parameter e > set 

cj)^\x) : = inf (v T x + c) 
(v,c) 

with the infimum taken over all (v,c) Gl^xR such that \\v\\ < e~ l and 
0(y) < v T y + c for all y G M. d . This defines a function G which is 
real-valued and Lipschitz- continuous with constant e _1 . Moreover, it satisfies 
^>( £ ) > <f> with equality if and only if <f> is real-valued and Lipschitz- continuous 
with constant e -1 . In general, cjr- £ ' ^(p pointwise as e^0- 
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Proof of Theorem 2.7. Let P be the distribution corresponding to F. 
Suppose first that <f> = ip(-\Q)- Then it follows from (4) and Fubini's theorem 
that 

0= f x(Q-P){dx) 
Jr 

(1{0 < t < x} - \{x < t < 0}) dt(Q - P){dx) 

i 

= [ {1{0< t}(F-G)(t)-l{t<0}(G-F)(t))dt 
Jr 

= [ (F-G)(t)dt. 
Jr 

Moreover, for any x € R, the function s i— > (s — x) + is convex so that (3) and 
Fubini's theorem yield 



0< f {s-x) + (Q-P)(ds) = - f (F-G)(t) 
Jr J -oo 



dt. 



It remains to be shown that f^^F — G)(t) dt>0 for x € S(<f>). Suppose 
first that x £ interior(dom((/>)). Note that <f>' := </>'(•+) is nonincreasing on 
the interior of dom(<^>) with 

4>( x 2) ~ ^{xi) = / du for xi,X2 € interior(dom(</>)) with x\ <X2- 

J Xl 

Moreover, x G S(4>) implies that <\>'{x — 5) > <fi'(x + 5) for all 5 > satisfying 
x ± 5 € interior(dom(0)). For such 5 > we define 



ffi(a):= f H' s (u)du 

J — oo 



with 

' 0, for u < x — 5, 

<f)'(x -5)- <f>'(u) 



H'siu) := 



for x — 5 < u < x + 5, 



4/(x-5)-4/{x + S) : 
k 1, for u > x + (5. 

One can easily verify that <^ + tHs is upper semicontinuous and concave 
whenever < t < <p'{x — 5) — 4>'(x + 5). In case of t < — inf^g^ 4>'(u) it is also 
coercive. Thus it follows from (2) that 

0< ( H s (s)(P - Q)(ds) ->■ f (s-x) + (P-Q)(ds) (610) 
Jr Jr 

= I (F-G)(t)dt. 

J — oo 
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When x E S(4>) is the left or right endpoint of dom (</>), we define A(s) := 
(s — x) + and conclude analogously that (F — G)(t) dt > 0. 

Now suppose that the distribution function F with log-density <j) G <E> 
satisfies the integral (in)equalities stated in Theorem 2.7. Let A:R — ?■ R be 
Lipschitz-continuous with constant L, so for arbitrary x,y G R with x <y, 

A(y)-A(z)= I A'{t)dt 

J X 

with A':R— )• [— L,L] measurable. Then 

/ Ad{Q-P) = [ A'(t)(F -G)(t)dt. 
J Jr 

Since J(F — G)(t) dt = 0, we may continue with 

/ Ad{Q-P)= [ (A'(t) + L)(F-G)(t)dt 
J Jr 

= 11 l{s<A'(t)}ds(F-G)(t)dt 
JrJ~l 

(F-G)(t)dtds 



-L JA(A',s) 

with A(A',s) := {t G R:A'(i) > s}. Now we apply this representation 
to the function A := <f>^ for some e > 0, that is, L = e~ l . Here one can 
show that A(A',s) equals either or K or a half-line with right endpoint 
a(4>, s) = min{i G R : 4>'{t+) < s}. But this entails that a(4>, s) G S(4>), whence 
Ia(A' s)(F ~ G)(t)dt = for all s G (—L,L). Consequently, 



J (e) d(Q - P) = 0. 



If we consider A := with -0 := ifj(-\Q), the sets A(A', s) are still half-lines 
with right endpoint or empty or equal to R. Thus J A ^ A , S ^(F — G)(t) dt < 
for all s G (— L,L), whence 



y V> (e) d(Q - P) < 0. 



Since (/>( e ) ! (/> and J, ^ as e i 0, and since J 4> dP and J ip dQ exist in R, we 
can deduce from monotone convergence that f cj)d(Q — P) = > J tpd(Q — 
P). Since / dx = J dx = l, this entails that 

L{cj>, Q) = L{<j>, P) > L(V, P) > L(Y>, Q), 

where the first displayed inequality follows from log-concavity of P with 
log-density <j>. Thus <f> = ij). □ 
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Theorem 2.14 and the second part of Theorem 2.15 are a consequence of 
the following result: 

Theorem 4.4. Let (Q n ) n be a sequence of distributions in Q such that 
Qn —>w Q S Q , L(Q n ) — > A € [— oo, oo] and J \\x\\Q n (dx) -> 7 6 [0, 00] as 
n — » 00. Then 7 > J \\x\\Q(dx) , and A > —00 if and only if '7 < 00. Moreover, 



A ^ 



<L(Q), i/ 7 >y HzHQCds), 

= X(Q)eR, if 7 = J \\x\\Q(dx) <oo. 



In t/ie latter case, the densities f := exp o^(-|Q) and / n := exp o-0(-|Q n ) 
are weZZ defined for sufficiently large n and satisfy 

lim f n (x) = f(y) for ally GR d \8{f>0}, 
n— yoo,x— yy 

limsup / n (x) < /(y) /or y G 5{/ > 0}, 

>oo,x— >y 

lim / |/ n (x)-/(s)|dx = 0. 

n— >oa J 

Before presenting the proof of this result, let us recall two elementary 
facts about weak convergence and unbounded functions: 

Lemma 4.5. Suppose that (Q n ) n is a sequence in Q converging weakly 
to some distribution Q. If h is a nonnegative and continuous function on M. d , 
then 

lim inf / hdQ n > / hdQ. 

n->oo J J 

If the stronger statement lim n _ s . 0O j h dQ n = J h dQ < 00 holds, then 

lim I fdQ n = [ fdQ 

for any continuous function f on R d such that |/|/(1 + h) is bounded. 

Proof of Theorem 4.4. The asserted inequality 7 > / ||a;||Q(dx) fol- 
lows from the first part of Lemma 4.5 with h(x) := \\x\\. 
Suppose that 7 < 00. Then with cp(x) := — ||x||, 

A> lim L((f>,Q n ) = -7 - / e~" x " dx + 1 > -00. 

n— loo J 

In other words, A = — 00 entails that 7 = 00. 
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From now on suppose that A > — oo, and without loss of generality let 
L{Qn) > —oo for all n G N. We have to show that 7 < 00 and that A < L(Q) 
with equality if and only if 7 = J \\x\\Q(dx). To this end we analyze the 
functions ip n := ip(-\Q n ) and their maxima M n := max x ^d ip n {x). First of all, 

(7) (M n ) n is bounded. 

This can be verified as follows: since L(Q n ) = J i/j n dQ n < M n , the sequence 
(M n ) n satisfies lim inf n _>oo M n > A. With similar arguments as in the proof 
of Theorem 2.2 one can deduce that (M n ) n is bounded from above, provided 
that 

limsup Q n {C n ) < 1 

for any sequence of closed and convex sets C n C M. d with lim n Leb(C n ) = 0. 
To this end we refer to the proof of Lemma 2.1 in [DSS 2010]: there exist 
a simplex A = conv(xo, . . . , xj) with positive Lebesgue measure and open 
sets U , Ui,...,U d with Q(Uj) > rj > for < j < d, such that A C C for 
any convex set C with C PI Uj 7^ for < j < d. But lim inf n Q n (Uj) > 
Q(Uj) > Tj for all j. Hence Leb(C n ) < Leb(A) entails that Q n {C n ) < 1 — 
mmQ^j^QniUj) < 1 — i] + o(l) as n — >• 00. 

Another key property of the functions ip n is that 

(8) liminf ij) n (x ) > —00 for any x a G interior(csupp(Q)). 

n— >oo 

For 

whence as n — >• 00, 

max(M n ,0) - L(Q n ) limsup^ 00 max(M^, 0) - A { 
1 - ft(<y„, x ) 1 - h(Q, x ) 

by virtue of Lemma 2.13. Combining (5) with (7) we may again deduce that 
there exist constants a and b > such that 

(9) ip n (x) < a - b\\x\\ for all n G N,x G R d . 

As in the proof of Theorem 2.2 we can replace (Q n )n with a subsequence 
such that for suitable constants a, b > and a function ip £ $ the following 
conditions are met: interior(csupp(<5)) C dom(^) and 

ip n (,y),4>(y) <a-b\\y\\ for all y G R d , n G N, 
lim ip n (x) = ip(y) for all y G interior (dom(^)), 

n— >-oo,ai— >y 

limsup ip n (x) < f° r a h ?/ G M d . 
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In particular, 

A = lim / ip n dQ n < lim / (a — b\\x\\)Q n (dx) = a — bj, 



whence 

7 < oo. 

Moreover, j exp(ip(x)) dx = lim n _ s . 0O J exp(ip n (x)) dx = 1, by dominated con- 
vergence. 

By Skorohod's theorem, there exists a probability space (Cl,A,¥) with 
random variables X n ~ Q n and X ~ Q such that lim n _ > . 00 X n = X almost 
surely. Hence Fatou's lemma, applied to the random variables H n := a — 
b\\X n \\ - ipn(Xn), yields 

A = lim / Vn = lim ( / (a - 6||x||) (iQ n - E(iJ n ) J 

n— >oo / n— >oo\J I 

< a — 67 — E I lim inf H n 



\ n—too 

< a - try - E(a - 6||X|| - ^(X)) 



x||Q(dx) -7 + / tjj(x)Q(dx 



<bU \\x\\Q(dx) - >y \ +L(Q). 



Thus A < L(Q) if 7 > J ||x||Q(dx). 

It remains to analyze the case 7 = J ||x||(5(cia;) < 00. Here A < L(ip,Q) < 

L(Q), and it remains to show that A > L(Q) which would entail that tp equals 
the unique maximizer tp := tp{-\Q). With the approximations tp^ > tp& > tp, 
< e < 1, introduced in Lemma 4.3, it follows from their Lipschitz-continuity 
and Lemma 4.5 that A = lirnn-^oo L(tp n ,Q n ) is not smaller than 

lim L{^ £ \Q n )=L{^ £ \Q)= [ iP^dQ- [ exp(^ (£) (x)) dx + 1. 



By monotone convergence, applied to the functions tp^ — tp^ e \ and domi- 
nated convergence, applied to exp otp( e \ 

A > lim L(V (e) , Q) = L(V, Q) = L{Q). 

Note that the probability densities / = exp o tp and f n = exp o tp n obvi- 
ously satisfy 

lim f n (x) = f(y) for all y E R d \ 8{f > 0}, 

n— yoo,x— yy 

limsup f n (x) < f(y) for all y G d{f > 0}. 

n— ¥oo,x— >y 
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In particular, (f n )n converges to / almost everywhere w.r.t. Lebesgue mea- 
sure, whence J\f n (x) — f(x) \ dx — s- 0. 

The only problem is that we established these properties only for a subse- 
quence of the original sequence (Q n )n- But elementary considerations out- 
lined in [DSS 2010] show that this is sufficient. □ 



Proof of Theorem 2.15. The assertions of this theorem are essen- 
tially covered by Theorem 4.4 as long as Q G Q H Q . It only remains 
to show that L(Q n ) -4 oo if D 1 (Q n ,Q) ->■ for some Q G Q l \ Q Q . Thus 
J \\x\\Q(dx) < oo and Q(H) = 1 for a hyperplane H = {x G R d :u T £ = r} 
with a unit vector u G R d and some r£i. For k > 1 we define ^ G $ via 

:= -||a fc + B fc x|| +log(A;), 

where := I — uu T + kuu T is a real, d x d matrix and a& := —kru. Note 
that det(i?fc) = A; and 4>k{ x ) = log(fc) — ||x|| for a; € H . Thus 



L{<j> k ,Q n )^L{4> h ,Q) = \og{k)- J \\x\\Q(dx) + J e'^dx. 
Since the right-hand side may be arbitrarily large, linin^oo L(Q n ) = oo. □ 

Proof of Theorem 3.2. Note that vh-Q v defines a continuous map- 
ping from R n into the space of probability distributions on R with finite first 
moment, equipped with Mallows distance D\. Moreover, by our assumption 
that Y ^ M. (x) , none of the distributions Q m r x ) , m E M, degenerates to 
a Dirac measure. According to Theorem 2.15, the mapping v i— > L(Q V ) is 
thus continuous from M(x) into R. 

When proving existence of a maximizer, as explained in Section 3.1, we 
may restrict our attention to the closed subset A4(x,Y) := {v G M(pc) :v = 
Y} of M(x), where generally w denotes the arithmetic mean n~ l X^=i w i 
for a vector w G R n . But for v G A4(x,F), 

/_. n 1 n 1 n 
\x - n(Q v )\Q v (dx) = - y~] \Yi - Vi\ > - ^] \vi\ V" \Yi\, 
i=l i=l i=l 

and the right-hand side tends to infinity as ||v|| —¥ oo. Thus it follows from 
Lemma 3.1 that 

L(Q V ) — > -oo as ||v|| — > oo, v G A4(x,Y), 

and this coercivity, combined with continuity of v i— > L(Q V ) and M(x,Y) 
being closed, yields the existence of a maximizer. □ 
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Proof of Theorem 3.5. The proof that Q*R£ Q H Q 1 is elementary 
and omitted here. By affine equivariance (Remark 2.4), we may and do 
assume that J yR(dy) = 0. Now let ip := ip(-\Q) and ip := ip('\Q * R)- Then 

L(Q-kR) = J J i>{x + y)Q{dx)R{dy) = f i> R dQ, 

where 

i>B.{x) ■= J 4>(% + y)R(dy) < i>{x) 

by Jensen's inequality. Hence 

L(Q*R)< J ^dQ = L(^,Q)<L(Q). 

Now suppose that L(Q * R) = L(Q), so in particular, tp = tp. It follows 
from ipR < ip G $ and Fatou's lemma that ifiR G $ with j exp(ipn(x)) dx < 
j exp(^(x)) dx = 1. Thus 

L(Q) = L(Q*R)<L$ R ,Q) < L(Q), 
that is, ipR = ij) = t/j and 

(10) ip(x)= J \l>{x + y)R(dy) for all x G R d . 

It remains to be shown that (10) entails R = 5q. Note that K := {x G 
M. d : -i/^a?) = M a } with M D := max yeR d t/)(y) defines a compact set. Hence for 
any unit vector u G jR d there exists a vector x(ii) G K such that u T x(u) > 
u T x for all x G K. But then ?/>(x(u) + y) < M for all y G M d with u T y > 0. 
Hence 



Af = = y i>{x{u) + y)R{dy) 



implies that R{y : u T y > 0} = 0. Since u is an arbitrary unit vector, this 
entails that csupp(i?) = {0}, that is, R = 5q. □ 

Proof of Theorem 3.6. Assumptions (A. 2) and (A. 3) imply that 
the empirical distribution Q n := Q n ,iJ, n of the true errors e n i satisfies both 
D B UQn,Q) and f\t\Q n (dt) ^ p f\t\Q(dt). Thus D 1 (Q n ,Q) ^ p 0. 

To verify the assertions of the theorem it suffices to consider a sequence of 
fixed vectors e n = (£m)f=i G ffi n such that for a constant c > to be specified 
later, 

(11) D 1 (Q n ,Q) + SUp D B L(Qn,m,Qn,m) ~> 0. 

m£M: ||(m-/i n )(x„)||„<c 
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Our goal is to show that (f n , fi n ), viewed as a function of e n and thus fixed, 
too, is well defined for sufficiently large n with 

(12) J\f n (x)-f(x)\dx->0 and \\{fi n - /i„)(x n )|| n 0. 

Note that we replaced f n with / = exp o ip(-\Q) because J \f n {x) — f(x)\ dx 
tends to 0. 

We know already that we have to restrict our attention to the set M n of all 
m G M n such that / tQ n ,m{dt) = 0, that is, / tR( Un _ m )( Xn )(dt) = — / tQ n (dt) 
converges to 0. Since {m(x n ) : m G M n } is a closed subset of W 1 by (A.l), we 
may argue as in the proof of Theorem 3.2 that a maximizer fi n of L(Q n ^ m ) 
over all m G _M n does exist. It is possible that L{Q n ^ n ) = oo, but if we can 
show that Di(Q n ji n ,Q) — > 0, then f n exists for sufficiently large n, too. Thus 
we may rephrase (12) as 

(13) D 1 (M n ,Q)^0 and J \t\R n (dt) -> 0, 

where M n := Q n ^ n and i? n := % n -£ n )(x„)- 

Note first that /2 n := fi n + J" tQ n (dt) belongs to M. n , whence 

(14) L(M n )>L(Q niiin ) = L(Q n )^L(Q) 
by Theorem 2.15. On the other hand 

J \t\M n (dt) = ^^2\e ni + (fi n - fi n )(x ni )\ > J \t\R n (dt)-f \t\Q n (dt). 
i=i 

Thus, by Lemma 3.1, fi n satisfies f \t\R n (dt) < c for sufficiently large n G N, 
provided that c is larger than J \t\Q(dt) +exp(— L(Q)). In particular, 

D BL (M n ,Q n * R n ) = D Bh (M n , Q njA J 0. 

Since D B h(Qn * Rn, Q * Rn) < -DBiXQn, Q) — > 0, we know that even 

D BL (M n ,Q*Rn)^0- 

Since (R n ) n is tight, to verify (13) we may consider a subsequence (R n (k))k 
that converges weakly to some distribution R as & — >■ oo. Then M n (k) — > w 
Q*R, so 

lim sup L{M n(k) ) < L(Q*R) < L(Q) 

fc— S-oo 

by Theorems 2.14 and 3.5. Because of (14) we even know that L(M n (u\) — > 
L(Q -k R) = L(Q) as k — >• oo. Consequently, we may deduce from Theorems 
2.14 and 3.5 that 

lim Di(M n (k),Q * R) = and R = 5 a for some a G K. 
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It remains to be shown that a = and lirn^oo f \t\R n ^(dt) = 0. Elemen- 
tary arguments reveal that for arbitrary r > and n € N, 



\t\M n (dt)> J min(|i|, r)M n (dt) + J \t\R n (dt) 

-J mm(\t\,2r)R n (dt) - J (\t\ - r) + Q n {dt). 
Hence J \t\R n ^(dt) is not greater than 

mm(\t\,2r)R n(k) (dt) + J (\t\ - r) + M n(k) (dt) + J \\t\ - r)+Q n{k) (dt) 

mm(\t\,2r)R(dt) + J (\t\ - r) + Q * R(dt) + J \\t\ - r) + Q(dt) 



as k —> oo. As r f oo, the limit on the right-hand side converges to J \t\R(dt) = 
\a\. Consequently, lim^oo D\(R n ^,R) = 0. But then = \\m k ->oo j tR n (k)(dt) 
coincides with J tR(dt) = a. □ 

In our proofs of Theorems 3.7 and 3.8 we utilize a simple inequality for the 
bounded Lipschitz distance in terms of the Kolmogorov-Smirnov distance, 

D KS (Q, Q') ■= sup \(Q' - Q)((-oo,t])|, 
teR 

of two distributions Q,Q' £ Q(l): 

Lemma 4.6 (DSS 2010). Let Q and Q' be distributions on the real line. 
Then for arbitrary r > 0, 

Dbl(Q, Of) < 4Q(M \ (-r, r}) + 4(r + 1)D KS (Q, Q'). 

Proof of Theorem 3.7. A key insight is that the empirical distribu- 
tions 0n,m are close to their expectations Qn,m with respect to Kolmogorov- 
Smirnov distance, uniformly over all m € A4 n . Namely, 

sup \(Qn,m-Qn,m)((-00,r])\ 
me7M„,reR 



sup 



1 ™ 

- V(l{y ni - b T x m <s}- Pr(y m - b T x ni < s)) 



n . 

i=i 



< sup \(M n - M n )(H)\, 
HeH n 

where H n denotes the family of all closed half-spaces in ]R 9 ( n ) +1 while M n 
is the empirical distribution of the random vectors (Y n i,x^ i ) T G R''™^ 1 , 
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1 < i < n, and M n := EM„. Now we utilize well-known results from empirical 
process theory: 7i n is a Vapnik-Cervonenkis class with VC-dimension q(n) + 
3, and M n is the arithmetic mean of n independent random probability 
measures. Thus 



E SUp DKs{Qn,m-,Qn,m 
m£M n 

for some universal constant C [see Pollard (1990), Theorems 2.2 and 3.5, 
and van der Vaart and Wellner (1996), Theorem 2.6.4 and Lemma 2.6.16]. 

Since for fixed c > the family {Q n , m '■ n £ N, m G Ai n with || (m — fj, n ) x 
( x n)||n < c} is tight, the previous finding, combined with Lemma 4.6, implies 
that 

Hm E SUp %(Q«,m,Qn,m) = 0. 

mGA4„ : ||(m-/j n )(x n )|| n <c U 
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